9 research outputs found
The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey
Biomedical experts are facing challenges in keeping up with the vast amount of biomedical knowledge published daily. With millions of citations added to databases like MEDLINE/PubMed each year, efficiently accessing relevant information becomes crucial. Traditional term-based searches may lead to irrelevant or missed documents due to homonyms, synonyms, abbreviations, or term mismatch. To address this, semantic search approaches employing predefined concepts with associated synonyms and relations have been used to expand query terms and improve information retrieval. The National Library of Medicine (NLM) plays a significant role in this area, indexing citations in the MEDLINE database with topic descriptors from the Medical Subject Headings (MeSH) thesaurus, enabling advanced semantic search strategies to retrieve relevant citations, despite synonymy, and polysemy of biomedical terms. Over time, advancements in semantic indexing have been made, with Machine Learning facilitating the transition from manual to automatic semantic indexing in the biomedical literature. The paper highlights the journey of this transition, starting with manual semantic indexing and the initial efforts toward automatic indexing. The BioASQ challenge has served as a catalyst in revolutionizing the domain of semantic indexing, further pushing the boundaries of efficient knowledge retrieval in the biomedical field
Beyond MeSH: Fine-Grained Semantic Indexing of Biomedical Literature based on Weak Supervision
In this work, we propose a method for the automated refinement of subject
annotations in biomedical literature at the level of concepts. Semantic
indexing and search of biomedical articles in MEDLINE/PubMed are based on
semantic subject annotations with MeSH descriptors that may correspond to
several related but distinct biomedical concepts. Such semantic annotations do
not adhere to the level of detail available in the domain knowledge and may not
be sufficient to fulfil the information needs of experts in the domain. To this
end, we propose a new method that uses weak supervision to train a concept
annotator on the literature available for a particular disease. We test this
method on the MeSH descriptors for two diseases: Alzheimer's Disease and
Duchenne Muscular Dystrophy. The results indicate that concept-occurrence is a
strong heuristic for automated subject annotation refinement and its use as
weak supervision can lead to improved concept-level annotations. The
fine-grained semantic annotations can enable more precise literature retrieval,
sustain the semantic integration of subject annotations with other domain
resources and ease the maintenance of consistent subject annotations, as new
more detailed entries are added in the MeSH thesaurus over time.Comment: 36 pages, 8 figures; Dictionary-based baselines added and conclusions
update
iASiS Open Data Graph: Automated Semantic Integration of Disease-Specific Knowledge
In biomedical research, unified access to up-to-date domain-specific
knowledge is crucial, as such knowledge is continuously accumulated in
scientific literature and structured resources. Identifying and extracting
specific information is a challenging task and computational analysis of
knowledge bases can be valuable in this direction. However, for
disease-specific analyses researchers often need to compile their own datasets,
integrating knowledge from different resources, or reuse existing datasets,
that can be out-of-date. In this study, we propose a framework to automatically
retrieve and integrate disease-specific knowledge into an up-to-date semantic
graph, the iASiS Open Data Graph. This disease-specific semantic graph provides
access to knowledge relevant to specific concepts and their individual aspects,
in the form of concept relations and attributes. The proposed approach is
implemented as an open-source framework and applied to three diseases (Lung
Cancer, Dementia, and Duchenne Muscular Dystrophy). Exemplary queries are
presented, investigating the potential of this automatically generated semantic
graph as a basis for retrieval and analysis of disease-specific knowledge.Comment: 6 pages, 2 figures, accepted in IEEE 33rd International Symposium on
Computer Based Medical Systems (CBMS2020
Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning
Semantic indexing of biomedical literature is usually done at the level of
MeSH descriptors, representing topics of interest for the biomedical community.
Several related but distinct biomedical concepts are often grouped together in
a single coarse-grained descriptor and are treated as a single topic for
semantic indexing. This study proposes a new method for the automated
refinement of subject annotations at the level of concepts, investigating deep
learning approaches. Lacking labelled data for this task, our method relies on
weak supervision based on concept occurrence in the abstract of an article. The
proposed approach is evaluated on an extended large-scale retrospective
scenario, taking advantage of concepts that eventually become MeSH descriptors,
for which annotations become available in MEDLINE/PubMed. The results suggest
that concept occurrence is a strong heuristic for automated subject annotation
refinement and can be further enhanced when combined with dictionary-based
heuristics. In addition, such heuristics can be useful as weak supervision for
developing deep learning models that can achieve further improvement in some
cases.Comment: 48 pages, 5 figures, 9 tables, 1 algorith
Overview of BioASQ 2023: The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
This is an overview of the eleventh edition of the BioASQ challenge in the
context of the Conference and Labs of the Evaluation Forum (CLEF) 2023. BioASQ
is a series of international challenges promoting advances in large-scale
biomedical semantic indexing and question answering. This year, BioASQ
consisted of new editions of the two established tasks b and Synergy, and a new
task (MedProcNER) on semantic annotation of clinical content in Spanish with
medical procedures, which have a critical role in medical practice. In this
edition of BioASQ, 28 competing teams submitted the results of more than 150
distinct systems in total for the three different shared tasks of the
challenge. Similarly to previous editions, most of the participating systems
achieved competitive performance, suggesting the continuous advancement of the
state-of-the-art in the field.Comment: 24 pages, 12 tables, 3 figures. CLEF2023. arXiv admin note: text
overlap with arXiv:2210.0685
Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania,There is a pressing need to exploit recent advances in natural language processing technologies, in
particular language models and deep learning approaches, to enable improved retrieval, classification
and ultimately access to information contained in multiple, heterogeneous types of documents. This is
particularly true for the field of biomedicine and clinical research, where medical experts and scientists
need to carry out complex search queries against a variety of document collections, including literature,
patents, clinical trials or other kind of content like EHRs. Indexing documents with structured controlled
vocabularies used for semantic search engines and query expansion purposes is a critical task for enabling
sophisticated user queries and even cross-language retrieval. Due to the complexity of the medical domain
and the use of very large hierarchical indexing terminologies, implementing efficient automatic systems
to aid manual indexing is extremely difficult. This paper provides a summary of the MESINESP task
results on medical semantic indexing in Spanish (BioASQ/ CLEF 2021 Challenge). MESINESP was carried
out in direct collaboration with literature content databases and medical indexing experts using the DeCS
vocabulary, a similar resource as MeSH terms. Seven participating teams used advanced technologies
including extreme multilabel classification and deep language models to solve this challenge which can
be viewed as a multi-label classification problem. MESINESP resources, we have released a Gold Standard
collection of 243,000 documents with a total of 2179 manual annotations divided in train, development
and test subsets covering literature, patents as well as clinical trial summaries, under a cross-genre
training and data labeling scenario. Manual indexing of the evaluation subsets was carried out by three
independent experts using a specially developed indexing interface called ASIT. Additionally, we have
published a collection of large-scale automatic semantic annotations based on NER systems of these
documents with mentions of drugs/medications (170,000), symptoms (137,000), diseases (840,000) and
clinical procedures (415,000). In addition to a summary of the used technologies by the teams, this paperS
BioASQ at CLEF2022: the tenth edition of the large-scale biomedical semantic indexing and question answering challenge
The tenth version of the BioASQ Challenge will be held as an evaluation Lab within CLEF2022. The motivation driving BioASQ is the continuous advancement of approaches and tools to meet the need for efficient and precise access to the ever-increasing biomedical knowledge. In this direction, a series of annual challenges are organized, in the fields of large-scale biomedical semantic indexing and question answering, formulating specific shared-tasks in alignment with the real needs of the biomedical experts. These shared-tasks and their accompanying benchmark datasets provide an unique common testbed for investigating and comparing new approaches developed by distinct teams around the world for identifying and accessing biomedical information. In particular, the BioASQ Challenge consists of shared-tasks in two complementary directions: (a) the automated indexing of large volumes of unlabelled biomedical documents, primarily scientific publications, with biomedical concepts, (b) the automated retrieval of relevant material for biomedical questions and the generation of comprehensible answers. In the first direction on semantic indexing, two shared-tasks are organized for English and Spanish content respectively, the latter considering human-interpretable evidence extraction (NER and concept linking) as well. In the second direction, two shared-tasks are organized as well, one for biomedical question answering and one particularly focusing on the developing issue of COVID-19. As BioASQ rewards the approaches that manage to outperform the state of the art in these shared-tasks, the research frontier is pushed towards ensuring that the valuable biomedical knowledge will be identifiable and accessible by the biomedical experts.Google was a proud sponsor of the BioASQ Challenge in 2021. The tenth edition of BioASQ is also sponsored by Atypon Systems inc. The DisTEMIST task is supported by the Spanish Plan for the Advancement of Language Technologies (Plan TL), the 2020 Proyectos de I+D+i-RTI Tipo A (Descifrando El Papel De Las Profesiones En La Salud De Los Pacientes A Traves De La Mineria De Textos, PID2020-119266RA-I00), and HORIZON-CL4-2021-RESILIENCE-01 (BIOMAT+, 101058779).Peer ReviewedPostprint (author's final draft
BioASQ at CLEF2023: The Eleventh Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge
The large-scale biomedical semantic indexing and question-answering challenge (BioASQ) aims at the continuous advancement of methods and tools to meet the need of biomedical researchers and practitioners for efficient and precise access to the ever-increasing resources of their domain. With this purpose, during the last ten years a series of annual challenges have been organized with specific shared tasks on large-scale biomedical semantic indexing and question answering. Benchmark datasets have been concomitantly provided in alignment with the real needs of biomedical experts. BioASQ provides a unique common testbed where different teams around the world can investigate and compare new approaches for identifying and accessing biomedical knowledge. The eleventh version of the BioASQ Challenge will be held as an evaluation Lab within CLEF2023. In this version, three shared tasks will be presented: (i) the automated retrieval of relevant material for biomedical questions, and the generation of comprehensible answers. (ii) the synergistic retrieval of relevant material and generation of answers for open biomedical questions about developing topics, in collaboration with the experts posing the questions. (iii) the automated indexing of unlabelled clinical procedures-specific medical documents, primarily clinical case reports written in Spanish, with biomedical concepts and the extraction of human-interpretable evidence. As BioASQ rewards the methods that outperform the state of the art in these shared tasks, it pushes the research frontier towards approaches that accelerate access to biomedical knowledge.Google was a proud sponsor of the BioASQ Challenge in 2022. The eleventh edition of BioASQ is also sponsored by Atypon Systems inc. The task Med- ProcNER is supported by the Spanish Plan for the Advancement of Language Technologies (Plan TL), the 2020 Proyectos de I+D+i-RTI Tipo A (Descifrando El Papel De Las Profesiones En La Salud De Los Pacientes A Traves De La Mineria De Textos, PID2020-119266RA-I00). This project has received funding from the European Union Horizon Europe Coordination and Support Action under Grant Agreement No 101058779 (BIOMATDB) and DataTools4Heart - DT4H, Grant agreement No 101057849.Peer ReviewedPostprint (author's final draft